Model Selection

Image Understanding

# Image Understanding

Qwen2 VL 2B GGUF

Qwen2-VL-2B is a vision-language model that provides a quantized version in GGUF format, suitable for various scenarios.

Transformers English

Llava Critic 7b Hf

This is a transformers-compatible vision-language model with image understanding and text generation capabilities

LLaVA-Saiga-8b is a vision-language model (VLM) developed based on the IlyaGusev/saiga_llama3_8b model, primarily optimized for Russian tasks while retaining English processing capabilities.

Transformers Supports Multiple Languages

Llava Calm2 Siglip

llava-calm2-siglip is an experimental vision-language model capable of answering questions about images in Japanese and English.

Transformers Supports Multiple Languages

Paligemma 3B Chat V0.2

A multimodal dialogue model fine-tuned based on google/paligemma-3b-mix-448, optimized for multi-turn conversation scenarios

Transformers Supports Multiple Languages

Paligemma Vqav2

This model is a fine-tuned version of google/paligemma-3b-pt-224 on a subset of the VQAv2 dataset, specializing in visual question answering tasks.

Llava Llama 3 8b V1 1 GGUF

LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks

Llava Phi 3 Mini Hf

LLaVA model fine-tuned based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks

Blip Finetuned Fashion

This model is a visual question answering model fine-tuned from Salesforce/blip-vqa-base, specializing in the fashion domain

Eris PrimeV3 Vision 7B

Eris Prime V2 is a 7B-parameter multimodal language model with vision capabilities, requiring Koboldcpp for operation.

ChaoticNeutrals

Candle Llava V1.6 Mistral 7b

LLaVA is a vision-language model capable of understanding and generating text related to images.

TeCoA is a vision-language model initialized from OpenAI CLIP, enhanced with supervised adversarial fine-tuning for improved robustness

Llava V1.6 Vicuna 13b Gguf

LLaVA is an open-source multimodal chatbot based on the Transformer architecture, offering various model versions that balance size and quality through quantization techniques.

Ggml Llava V1.5 7b

LLaVA is a vision-language model capable of understanding and generating text related to images.

Pix2struct Vizwizvqa Base

This is a visual question answering model based on the Apache-2.0 license, supporting the English language, and focusing on handling vision-related question answering tasks.

Transformers English

Llava V1.5 13B GPTQ

Llava v1.5 13B is a multimodal model developed by Haotian Liu, combining visual and language capabilities to understand and generate content based on images and text.

Mplug Owl Llama 7b

mPLUG-Owl is a multimodal large language model based on the LLaMA-7B architecture, supporting image understanding and text generation tasks.

Transformers English

Taiyi BLIP 750M Chinese

A model focused on converting image content into text descriptions, supporting Chinese processing.

Text Recognition

Transformers Chinese

BEiT base model fine-tuned on an unknown dataset, specific use cases and performance details are currently unavailable

Large Language Model

Upernet Convnext Large

UperNet is a semantic segmentation framework combined with the ConvNeXt large backbone network for pixel-level semantic label prediction.

Image Segmentation

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase